IMDb is a popular online platform that is designed to help people review, research, and explore the world of movies. The movie industry is a large part of most people’s lives, who regularly consume content featured on IMDb and other similar websites. The film and movie industry is a multi-billion dollar industry that employs thousands of people who are all interested in seeing their projects succeed financially. But a lot of the time, movies simply don’t perform well enough past their budget, or even meet it! A team of hundreds could work tirelessly on a movie and pour money into it for years only for it to flop, while another film might make millions of dollars past the budget in the first showing after it’s released in theaters or on streaming platforms. Therefore, the film industry is heavily invested in determining the qualities of a movie that correlate positively with strong revenue generation. Determining these money-making qualities is therefore not just a million dollar question, but potentially a multi-billion dollar question. This data exploration and analysis will work through the IMDb data, as well as make use of separate data sets that rank the popularity of the most popular directors, actors, and actresses, to determine what exactly it is that successful movies have in common and if it is possible to predict the revenue range that a movie will generate based off of historic data.
The Main IMDb Dataset - IMDB data from 2006 to 2016:
Found here: https://www.kaggle.com/PromptCloudHQ/imdb-data/code
The IMDb data from 2006 to 2016 we are taking into consideration has been divided into several columns:
We sourced from these two main websites to create new attribute columns:
As part of our ML analysis, we introduce new attribute columns: popularity of directors, actors, actresses, and genre as a metric in order to measure the contribution of a cast, directors, and genres in determining the revenue generated by a given movie. We did this because the original columns are qualitative and cannot be applied in our ML algorithims. Therefore we sourced from different datasets to assign rank/points to each directors/actor in order to average them and create a quantative values for further ML accuracy.
Below are the keys created from two main sources - IMDB lists and YouGovAmerica Polls that rank directors and actors. Lastly, we have the genre key that was created by summing all the movies based on genre to rank top grossing genres in order to create its own key:
And these are the new attribute columns that are created using the keys. We would loop through each set of actors to grab their associated ranks/points and then average them by group size. For the directors we just grab their associated rank/points. Thereby creating these columns:
Revenue Per IMDb Average Cast Points
Revenue Per YGA Average Cast Ranks
Revenue Per IMDb Average Director Points
Revenue Per YGA Director Rank
Revenue Per Average Genre Rank
We did some preliminary research on previous projects that have looked at this issue or similar. We came across some interesting papers and articles. Using this as a basis we approached our project, and the insights from this research are imbedded throughout this notebook.
The data is read in from the CSV file to a pandas DataFrame. All rows with missing values for revenue are taken out as one of the biggest measures of success that is being explored is revenue. It does not make sense for to keep rows with missing values and is easier to work with the remaining values. Each type is modified to be the type they should be. For example,'Rank', 'Year', 'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)', and 'Metascore' should all be numeric types and 'Genre' and 'Actors' should be a list of strings.
#!pip install pdfplumber
# !pip install plotly
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import re
import pdfplumber
from IPython.display import display
import warnings
warnings.filterwarnings('ignore')
#Reads in the IMDB Movie Data CSV file and removes all the empty values for revenue and resets the index
movie_data = pd.read_csv("IMDB-Movie-Data.csv")
movie_data = movie_data[pd.notnull(movie_data['Revenue (Millions)'])].reset_index()
movie_data.drop(columns=['index'], inplace = True)
#Changes the appropriate columns to the correct type and creates a list of genres and actors
change_types = ['Rank', 'Year', 'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)', 'Metascore']
for i in change_types:
movie_data[[i]] = movie_data[[i]].apply(pd.to_numeric)
movie_data['Genre'] = movie_data['Genre'].apply(lambda x: x.split(','))
movie_data['Actors'] = movie_data['Actors'].apply(lambda x: re.split(',\s?', x))
movie_data.head()
The first aspect explored was to see if there was a clear genre which generated the most revenue. To do this, a dictionary was used to keep track of each genre's revenue. Sometimes a movie had multiple genres and to account for that, each genre of the movie was given the entirety of the movie's revenue. It was then put into a pie graph to best visualize the data.
#Dictionary to get each genre's total revenue
genre_list = {}
total_sum = sum(movie_data['Revenue (Millions)'])
#Goes through and gets each genre's revenue and adds it to the dictionary
for i in range(len(movie_data['Genre'])):
for j in movie_data.loc[i,'Genre']:
if j in genre_list:
genre_list[j] = genre_list[j] + movie_data.loc[i,'Revenue (Millions)']
else:
genre_list[j] = movie_data.loc[i,'Revenue (Millions)']
#DataFrame created holding each genre and their total revenue for the pie chart
df = pd.DataFrame()
df['Genres']=genre_list.keys()
df['Revenue'] = genre_list.values()
#Plots the dataframe as a pie chart and displays the percent, label name and genres
fig = px.pie(df, values='Revenue', names='Genres', title='Revenue By Genre')
fig.update_traces(textinfo='percent+label+value', textposition='inside')
fig.show()
From this graph it is clear that the largest genre is Adventure which generated around $38,852.61. It accounted for About 19.5\% of all the revenue amongst all movies. It can be seen that the top three genres are Adventure, Action and Drama and they account for nearly half of all the revenue generated among all movies. It can then be said that, according to the data, if a movie include one or more of these top three genre's, it will likely successfully generate a lot of money as shown by the trend in the pie graph.
The next aspect explored was to see how revenue changed over the years and to see if there was a conclusion to be drawn from the data. For this part, for each year, all the movie's revenues were added together and plotted on a line graph to see if there is a clear trend.
revenue = []
for i in np.unique(movie_data['Year']):
revenue.append(sum(movie_data[movie_data['Year'] == i]['Revenue (Millions)']))
plt.plot(np.unique(movie_data['Year']), revenue)
plt.xlabel('Years')
plt.ylabel('Revenue (Millions)')
plt.title('Revenue Per Year')
plt.show()
It can be seen from the line graph above that as the years go on, the revenue generated generally increases. There was an especially high increase from 2015 to 2016 that is interesting to see. It can also be seen that there us a significant dip in revenue between 2010 to 2011 but a sharp increase the year after. Despite these interesting flucuations, there is a clear general increase in revenue over the years.
One factor to consider is how the ratings changed over the years to see how movies generally did. To do this, all the ratings were summed up and averaged and plotted on a scatterplot. To help show if there is a general trend, a linear regression line was added. The polyfit function was used to help calculate the regression line.
#Gets the unique years to be plotted
x = np.unique(movie_data['Year'])
y = []
for i in x:
temp = movie_data[movie_data['Year'] == i]
y.append(sum(temp['Rating'])/len(temp))
plt.scatter(i, y[-1], c = 'b')
m, b = np.polyfit(x, y, 1)
plt.plot(x, m*x+b, c = 'r')
plt.xlabel('Years')
plt.ylabel('Ratings')
plt.title('Average Ratings Per Year')
plt.show()
It can be seen from the graph above, that there is a general downward trend between 2006-2016. It can then be said that as the years go on, the average rating decreases. It should also be noted that it is not exactly a huge decrease in ratings. The average rating dropped by less than one, which indicates the average rating has not dropped siginificantly but that there is still a general downward trend between 2006-2016.
Since the revenue and ratings were explored how they changed per year, the relationship between both variables is something to be explored to see if there was a relationship betwen ratings and revenues. In the dataset, there are two types of ratings. The first is the one that was preveiously explored which has the column label as 'Rating'. The other is the Metascore which is another type of rating based on the Metacritic website. Both these ratings' relationship will be explored in this section. To accomplish this graph, this function creates categories of ratings which can be either '0','1','2','3','4','5','6','7','8','9','10' or "No Data". To fit each category, each rating is floored, i.e. 4.3 and 4.9 would both be in the 4 category. Since the Metascore is out of 100, the Metascore is divided by 10 to scale it to the "Rating's" score and then floor it. The data is then plotted side by side, for the ease of comparison, as a bar graph. IT is important to note that in this graph each category is a range (4 spans from 4.0-4.9).
import math
rev_per_rating = [0] * 12
rev_per_meta = [0] * 12
for index in range(len(list(movie_data['Rating']))):
rev_per_rating[math.floor(movie_data.loc[index,'Rating'])] += movie_data.loc[index,'Revenue (Millions)']
if math. isnan(movie_data.loc[index,'Metascore']):
rev_per_meta[-1] += movie_data.loc[index,'Revenue (Millions)']
else:
rev_per_meta[math.floor((movie_data.loc[index,'Metascore'])/10)] += movie_data.loc[index,'Revenue (Millions)']
plt.bar(np.arange(12), rev_per_rating, 0.35, color = 'b', label = "Ratings")
plt.bar(np.arange(12)+0.35, rev_per_meta, 0.35, color = 'r', label = "Metascores")
plt.xlabel('Ratings')
plt.ylabel('Revenue (Millions)')
plt.title('Revenue Per Ratings')
plt.xticks(np.arange(12)+0.35 / 2,['0','1','2','3','4','5','6','7','8','9','10',"No Data"])
plt.legend(loc = 'best')
plt.show()
From this bar graph, it can clearly be seen that most of the ratings from 'Rating' fall between the 6 and 8 range (inclusive) whereas the ratings from 'Metascore' have a curve shape with most rating falling between the 4 and 8 range (inclusive). The total revenue of movies whose rating was between 6 and 7 generated the most revenue for both type of ratings. From this given data it can be seen that even though other movies got higher ratings, the total revenue among all those movies were still lower than the ones whose rating was between the 6-7 range.
Another contributing factor of success of a movie could possibly due to a specific director. To analyze this, each director is taken and checked to see how much revenue their movies have made. From there the directors are sorted in reverse order or descending order such that the director who generated the most revenue among all their movies is first and the director who generated the least revenue is last. Then only the top 50 are plotted to see the relationship between successful director and revenue.
director_list = {}
for i in range(len(movie_data['Director'])):
director = movie_data.loc[i,'Director']
if director in director_list:
director_list[director] = director_list[director] + movie_data.loc[i,'Revenue (Millions)']
else:
director_list[director] = movie_data.loc[i,'Revenue (Millions)']
sorted_directors = sorted(director_list.items(), key=lambda x: x[1], reverse = True)
sorted_directors = sorted_directors[:50]
for x,i in enumerate(sorted_directors):
plt.scatter(x,i[1], label = i[0] + " = " + str(round(i[1],2)) + " million")
plt.title("Revenues For the Top 50 Directors")
plt.xlabel("Directors")
plt.ylabel("Revenue (Millions)")
plt.legend(bbox_to_anchor=(1.0, 1.0), title="Legend", loc='upper left')
plt.show()
From the graph above, the general shape looks as if it is an exponential decrease. Between the couple of directors who generated the most revenue, the difference in revenue is large compared to the ones who generated the least in the top 50. It can then be seen that certain directors likely are more successful than others and can thus increase the success of a movie. It is also important to note that this does not take into consideration the numer of movies they made which could possibly influence the revenue generated.
As mentioned in 2.6.1, the revenue that the director's generated could be influenced by the number of movies they made. For example it is possible that a director who had a movie that generated a lot of revenue and a movie that generated almost no revenue to be listed as higher than a director who had one successful movie that generated a decent amount of money. When averaging the total revenue each director generated by the number of movies they made, we get a better picture of how successful each director was. This can then be analyzed as a possible factor of what makes a movie successful. Like 2.6.1, each director's movie's revenue was summed up but this time it is divided by the number of movies the director made. This is the sorted in descending order and the top 50 are plotted.
d_list = {}
for i in np.unique(movie_data['Director']):
tmp = movie_data[movie_data['Director'] == i]
d_list[i] = sum(tmp['Revenue (Millions)'])/len(tmp)
sorted_directors = sorted(d_list.items(), key=lambda x: x[1], reverse = True)
sorted_directors = sorted_directors[:50]
for x,i in enumerate(sorted_directors):
plt.scatter(x,i[1], label = i[0] + " = " + str(round(i[1],2)) + " million")
plt.title("Average Revenues For the Top 50 Directors")
plt.xlabel("Directors")
plt.ylabel("Average Revenue (Millions)")
plt.legend(bbox_to_anchor=(1.0, 1.0), title="Legend", loc='upper left')
plt.show()
As shown in this graph, the top directors have changed when compared to the previous graph in 2.6.1. This graph still has an exponential decrease curve to it, but now gives a better picture as to which directors are most successful and how much they averaged. The difference between top directors in the top 50 are large possibly suggesting that these directors direct successful movies and are better then the directors below them.
This section is about creating the key t-charts to match actors, actresses, directors, and genres according to their rank or points from the two databases - IMDB and YGA (YouGovAmerica). These keys will be used to go through list of actors, directors, and list of genres in order to average or match their points/rank. Essentially, qualitative data like names and genre types become quantative through the use of these keys and can hopefully help with ML prediction.
These are two key dataframe produced from these datasets:
Csvs located in the github repository
'''
Read IMDB datasets in and create two keys - one for actors and one for directors.
Now, each director/actor is listed along with their point value representing their fame, subjectively to IMDB.
'''
actor_data = pd.read_csv("imdb_actors.csv", encoding='latin-1')
actress_data = pd.read_csv("imdb_actresses.csv", encoding='latin-1')
director_data = pd.read_csv("imdb_directors.csv", encoding='latin-1')
def to_num(lst):
ret = []
for i in range(len(lst)):
x = re.search("(\d+)[^\d]*", lst[i])
ret.append(int(x.group(1)))
return ret
actor_data['Description'] = to_num(actor_data['Description'])
actor_data.rename(columns={"Description": "Points"}, inplace = True)
actor_data = actor_data.filter(['Points', 'Name'])
actress_data['Description'] = to_num(actress_data['Description'])
actress_data.rename(columns={"Description": "Points"}, inplace = True)
actress_data = actress_data.filter(['Points', 'Name'])
imdb_cast_key = pd.concat([actor_data, actress_data])
imdb_cast_key = imdb_cast_key.sort_values('Points', ascending = False).reset_index(drop=True)
director_data['Description'] = to_num(director_data['Description'])
director_data.rename(columns={"Description": "Points"}, inplace = True)
imdb_director_key = director_data.filter(['Points', 'Name'])
display(imdb_cast_key.head())
display(imdb_director_key.head())
These are two key dataframe produced from these datasets:
Pdfs located in the github repository
'''
Read YGA datasets that are given the form of pdfs. Web scrapping proved to be two hard as the website was complex
with javascript interactions. Various pdf-to-text were tested and pdfplumber proved to be the best.
We used pdfplumber to read text from the pdf and transfer those ranking into a dataframe key.
Now, each director/actor is listed along with their point value representing their fame, subjectively to YGA.
'''
def pdf_to_df(file):
pdfdump = ""
with pdfplumber.open(file) as pdf:
for page in range(0, 6):
pdfdump += pdf.pages[page].extract_text()
pdfdump += "\n"
key = {"Name":[], "Rank":[], "Fame(%)":[], "Popularity(%)":[]}
for line in pdfdump.splitlines():
match = re.search("(\d+)\s(.*)\s(\d+)%\s(\d+)%",line)
if match:
key["Name"].append(match.group(2))
key["Rank"].append(match.group(1))
key["Fame(%)"].append(match.group(3))
key["Popularity(%)"].append(match.group(4))
return pd.DataFrame(key)
yga_cast_key = pdf_to_df('yga_cast.pdf')
yga_director_key = pdf_to_df('yga_directors.pdf')
display(yga_cast_key.head())
display(yga_director_key.head())
This key dataframe produced from genre column of the original dataset.
sorted_genres = sorted(genre_list.items(), key=lambda x: x[1], reverse=True)
key = {"Genre":[], "Rank":[], "Total Revenue":[]}
count = 1
for (k, v) in sorted_genres:
key['Genre'].append(k)
key['Rank'].append(count)
key['Total Revenue'].append(v)
count += 1
genre_key = pd.DataFrame(key)
genre_key.head()
Now we will use the keys generated above to create the five, new attribute columns.
'''
These purpose of these functions are to convert the qualitative columns ('Actors', 'Director', 'Genre') into
quantative values that can be processed by the ML algorithim.
'''
# This function will be used for actors and genre as they are originally of list type.
def avg_lst_keys(x, key_data, lst_label, x_label, y_label):
lst = x[lst_label] # Get list of actors or genre
sum = 0
count = 0
for obj in lst: # Iterate through list of object
if not key_data[key_data[x_label] == obj].empty: # Check to see if this object exists within the passed-in key
pts = int(key_data.loc[key_data[x_label] == obj, y_label].iloc[0]) # grab object's points/rank
sum += pts # add it overall sum
count += 1 # keep track of the number of objects that recorded in the key
else: # Missing object from key-dataset, some actors are missing in the two datasets, unfortunately
missing_obj.add(obj) # Track the missing objects
if count:
return (sum/count) # to get final average, divide sum by # of found objects
else:
return 0 # no objects from list exist in the key
# This function will be used for directors as there is only one recorded director per movie.
def match_key(x, key_data, label):
if not key_data[key_data['Name'] == x['Director']].empty: # Check to see if this object exists within the key
return int(key_data.loc[key_data['Name'] == x['Director'], label].iloc[0]) # return object's points/rank
else: # Missing object from key-dataset, some directors are missing in the two datasets, unfortunately
missing_obj.add(x['Director']) # Track the missing objects
return 0
'''
Calls to the above functions. We have actors, directors to transform using two datasets - IMDB and YGA.
Secondly, we will transform genre with genre rank key. Lastly, we will print out the number of missing
actors/directors across both datasources, in order to understand the level of effect it will have on data accuracy.
A more complete dataset would have been favorable, currently the best ones available.
'''
print("These are the number of cast members and directors that are missing from either datasources:")
# IMDB Cast Members Average Points Calculated
missing_obj = set()
movie_data['imdb_avg_cast_pts'] = movie_data.apply(lambda x: avg_lst_keys(x, imdb_cast_key, "Actors", "Name", "Points"), axis = 1)
print("# of IMDB Missing Cast Members: ", len(missing_obj))
# YGA Cast Members Average Rank Calculated
missing_obj = set()
movie_data['yga_avg_cast_rank'] = movie_data.apply(lambda x: avg_lst_keys(x, yga_cast_key, "Actors", "Name", "Rank"), axis = 1)
print("# of YGA Missing Cast Members: ", len(missing_obj))
# IMDB Director Average Points Calculated
missing_obj = set()
movie_data['imdb_avg_director_pts'] = movie_data.apply(lambda x: match_key(x, imdb_director_key, "Points"), axis = 1)
print("# of IMDB Missing Directors: ", len(missing_obj))
# YGA Cast Members Average Rank Calculated
missing_obj = set()
movie_data['yga_avg_director_rank'] = movie_data.apply(lambda x: match_key(x, yga_director_key, "Rank"), axis = 1)
print("# of YGA Missing Directors: ", len(missing_obj))
# Genre Average Rank Calculated
missing_obj = set()
movie_data['genre_avg_rank'] = movie_data.apply(lambda x: avg_lst_keys(x, genre_key, "Genre", "Genre", "Rank"), axis = 1)
print("# of Missing Genres: ", len(missing_obj))
In this part, the relationship between revenue and "IMDB Average Cast Points", "YGA Average Cast Ranks", "IMDB Average Director Points", "YGA Average Director Rank", "Average Genre Rank" will be explored. The purpose is to see how each rank or points do compared to the revenue generated.
'''
Plot newly created column attributes against revenue to see if there is a relationship between cast, director, or
genre and revenue. This gives an idea of the effectiveness of this addition to the data including if in general
Hollywood performs as we would expect where famous director and actors create higher grossing films and vice versa
'''
avgs = ['imdb_avg_cast_pts', 'yga_avg_cast_rank', 'imdb_avg_director_pts', 'yga_avg_director_rank', 'genre_avg_rank']
avg_name = ["IMDB Average Cast Points", "YGA Average Cast Ranks", "IMDB Average Director Points", "YGA Average Director Rank", "Average Genre Rank"]
for i,a in zip(avgs,avg_name):
for j in range(len(movie_data[i])):
plt.scatter(movie_data.loc[j,i], movie_data.loc[j,'Revenue (Millions)'], c = 'b')
plt.title("Revenue Per " + a)
plt.xlabel(a)
plt.ylabel('Revenue (Millions)')
plt.show()
In general, no graph between rank and revenue (whether it originates from IMDb or YGA, or concerns directors or actors) showed a strong relationship.
Lower Rank (1, 2, 3. etc) -> More popular directors, group of actors, or group of genres (data comes from YGA dataset) Higher Points (around 14,000) -> More popular directors or group of actors (data comes from IMDB dataset)
We would expect that... As rank increases, revenue decreases and as points increases, revenue increases. However this does not always seem to be the case as shown by the graphs above. In many cases, there are outliers that suggest less popular actors/directors make higher grossing films and vice versa, which is contradictory, yet an interesting find.
This part will attempt to revisit the grpahs above and provide a fix to further understand the previous relationships. It will be easier to draw conclusion by attempting to divide the data up into 100 sections and plot the point, which represents the averages, in each section to help draw a conclusion. A regression line will be added to help see the general trend of the data.
for i,a in zip(avgs,avg_name):
x = np.linspace(0, max(movie_data[i]),100)
m = []
n = []
for j in range(99):
tempdf = movie_data[(movie_data[i]>=x[j]) & (movie_data[i]<=x[j+1])]
if sum(tempdf['Revenue (Millions)']) > 0:
m.append(x[j])
n.append(sum(tempdf['Revenue (Millions)'])/len(tempdf))
plt.scatter(m[-1], n[-1], c = 'b')
line_parts = np.polyfit(m, n, 1)
line = np.poly1d(line_parts)
y_part = line(m)
plt.plot(m, y_part, c = 'r')
plt.title("Revenue Per " + a)
plt.xlabel(a)
plt.ylabel('Revenue (Millions)')
plt.show()
As seen in the graphs above,
From all the graphs obtained, the results were as expected, but not as strongly correlated as one would expect. Generally, the higher the rankings and points were, the higher revenue was for the movie.
'''
In preparation for ML, we must create a categorical column. The 'Revenue (Millions)' column as it exists now is
quantative, yet continous which cannot be predicted using ML. Therefore, we will make it categorical and split the
revenue column into range groups or classes. The ranges were created using data.describe()'s data spread statistics in
order to split the data roughly evenly.
Class 1: '00-15' Million in Revenue
Class 2: '16-50' Million in Revenue
Class 3: '51-115' Million in Revenue
Class 4: '116-1000' Million in Revenue
'''
groups = [1, 2, 3, 4] # Real groups represented by the numbers here are: '0-15', '16-50', '51-115','116-1000'
bins = [-1, 15, 50, 115, 1000]
# use cut function and above defined bins/labels to create new group column in master df
movie_data['revenue_groups'] = pd.cut(x= movie_data['Revenue (Millions)'], bins=bins, labels=groups)
movie_data['revenue_groups'] = movie_data['revenue_groups'].astype(int)
print("Split of the revenue into range groups:")
print (movie_data['revenue_groups'].value_counts())
'''
Preparing the data for ML by dropping all qualitative columns and filtering out extreme values.
Transitioning over to a new version of the original dataset - movie_data2.
'''
movie_data2 = movie_data.filter(['Year', 'Runtime (Minutes)', 'Rating', 'Votes', 'Metascore', 'Revenue (Millions)', 'imdb_avg_cast_pts', 'yga_avg_cast_rank', 'imdb_avg_director_pts', 'yga_avg_director_rank', 'genre_avg_rank', 'revenue_groups'])
movie_data2.replace([np.inf, -np.inf], np.nan, inplace=True) # replace extreme values
movie_data2.fillna(0, inplace=True) # replace nan values with 0 in other to create valid integer/float columns for ML
movie_data2.reset_index(drop=True)
movie_data2.head()
'''
Setting the ML models and their hyperparameters to be tested. These are the 3 best models that were run, others were
tested including logisitic regression, linear svm, etc.
'''
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
models = {
'K - Nearest Neighbors': {'type': KNeighborsClassifier(),
'params': [{'n_neighbors': [1, 3, 5, 10, 50], 'leaf_size': [3, 30]}]
},
'Decision Tree': {'type': tree.DecisionTreeClassifier(),
'params': [{'max_depth': [3, None]}]
},
'Random Forest': {'type': RandomForestClassifier(),
'params': [{'n_estimators': [500]}]
}
}
'''
The main ML training script. First, split the data in test and train in a 80-20 split.
Then, create a results dataframe that will eventually showcase results - accuracy score, precision, and other statistics.
Next, we will loop through each model declared above and use GridSearchCV to run every potential combination of
parameters for the best possible outcomes and attach those results to the dataframe. All of this will also be timed,
to track scoring efficiency. Lastly, we will create and calculate statistics from a confusion matrix and showcase those as
well.
'''
import time
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
# !pip install pycm
from pycm import ConfusionMatrix
from sklearn.metrics import accuracy_score
# Defining the X and Y(target) values
X = movie_data2.drop(['revenue_groups','Revenue (Millions)'], axis=1)
Y = movie_data2['revenue_groups']
# Split 80-20 train and test data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
print(len(X_train),'samples in training data and', len(X_test),'samples in test data\n')
# dataframe to write out comparison points of each models
df_results = pd.DataFrame(
data=np.zeros(shape=(2,8)),
columns = ['classifier',
'best_params',
'train_score',
'test_score',
'Accuracy by Class',
'Precision by Class',
'FPR by Class',
'FNR by Class'])
# Loop through dictionary defined previously, fit two models, score data, create confusion matrix, do calculations and print out results
count = 0
for name, model in models.items():
t_start = time.process_time()
grid = GridSearchCV(model['type'], model['params'], refit=True, cv = 10, scoring = 'accuracy')
estimator = grid.fit(X_train, Y_train)
t_end = time.process_time()
t_diff = t_end - t_start
#score fitted model and save predictions for test dataset
train_score = estimator.score(X_train, Y_train)
test_score = estimator.score(X_test, Y_test)
Y_pred = estimator.best_estimator_.predict(X_test)
cm = ConfusionMatrix(actual_vector=Y_test.to_numpy(),predict_vector=Y_pred) # create confusion matrix for below comparison points
FP = np.array(list(cm.FP.values()))
FN = np.array(list(cm.FN.values()))
TP = np.array(list(cm.TP.values()))
TN = np.array(list(cm.TN.values()))
PPV = TP/(TP+FP) # Precision or positive predictive value for each target class
FPR = FP/(FP+TN) # False Positive rate for each target class
FNR = FN/(TP+FN) # False Negative rate for each target class
ACC = (TP+TN)/(TP+FP+FN+TN) # Accuracy for each target class
# Set the generated results into the results dataframe
df_results.loc[count,'classifier'] = name
df_results.loc[count,'best_params'] = str(estimator.best_params_)
df_results.loc[count,'train_score'] = train_score
df_results.loc[count,'test_score'] = test_score
df_results.loc[count,'Accuracy by Class'] = str(ACC)
df_results.loc[count,'Precision by Class'] = str(PPV)
df_results.loc[count,'FPR by Class'] = str(FPR)
df_results.loc[count,'FNR by Class'] = str(FNR)
df_results.loc[count, '10-fold CV error estimate (w/ stderr)'] = (estimator.cv_results_.get('std_test_score').mean())/(math.sqrt(10))
print("trained {c} in {f:.2f} s".format(c=name, f=t_diff))
count += 1
display(df_results.sort_values(by='test_score', ascending=False)) # View each model's results side by side to compare attributes
The best performing machine learning models were K Nearest Neighbors, Random Forest, and Decision Tree. Initially, before adjusting the hyperparameters we saw test scores averaging around 30% for each model, and around 50-60% afterwards. As an example, the hyperparameters that we adjusted for K Nearest Neighbors were the number of neighbors and the leaf size. Though our models only average 50-60% for test scores, they are still somewhat useful. In future data analysis, we would hope to have more data in general as well as additional attribute columns. For example, a useful attribute could be the budget given for each movie since budget would be an excellent indicator of how much money film investors might expect to profit off of or even number of theaters scheduled to have the movie on opening night. Including this information as well as any additional data from pre-film launch would lead to much better prediction accuracy and stronger models overall.
We believe that our analysis and machine learning predictions are a good launching point for a further deep dive into this use case. Movie revenue prediction could become a very important factor in Hollywood and this notebook represents the tip of that venture. Thank you!!
This dataset was found pretty much at the end of our project, because this time we knew what to look for. But this Kaggle competition and dataset could be a great next step for this type of project. This dataset is much more detailed and includes more attributes like budget and even interesting other ones like poster and languages. The dataset is much larger and adds a new level of nuance because it a global dataset with movies from all over the world. This could add new factors to account for like the popularity of the various global movie industry themselves. For example, Bollywood is much larger then Hollywood in terms of revenue, audience, etc. so this would factor into revenue prediction.